Anthropic 谈“我们是如何构建多智能体研究系统的”

发布于: 2025年6月14日

发表于 2025年6月13日

Our Research feature uses multiple Claude agents to explore complex topics more effectively. We share the engineering challenges and the lessons we learned from building this system.

我们的“研究”功能利用多个 Claude 智能体来更有效地探索复杂主题。我们分享了构建这一系统过程中所遇到的工程挑战,以及从中汲取的经验教训。

Claude now has Research capabilities that allow it to search across the web, Google Workspace, and any integrations to accomplish complex tasks.

Claude 现在具备“研究”功能,可以搜索整个网络、Google Workspace 以及已集成的任何系统或服务,以完成复杂任务。

The journey of this multi-agent system from prototype to production taught us critical lessons about system architecture, tool design, and prompt engineering. A multi-agent system consists of multiple agents (LLMs autonomously using tools in a loop) working together. Our Research feature involves an agent that plans a research process based on user queries, and then uses tools to create parallel agents that search for information simultaneously. Systems with multiple agents introduce new challenges in agent coordination, evaluation, and reliability.

这个多智能体系统从原型发展到生产环境的历程,让我们学到了关于系统架构、工具设计和提示工程的重要经验。一个多智能体系统由多个智能体组成(这些智能体是能够在循环中自主使用工具的大型语言模型),协同完成工作。我们的“研究”功能包含一个智能体,根据用户的查询规划研究流程,然后利用工具创建多个并行的子智能体来同时搜索信息。具有多个智能体的系统在智能体协调、评估和可靠性方面带来了新的挑战。

This post breaks down the principles that worked for us—we hope you'll find them useful to apply when building your own multi-agent systems.

本文分解介绍了对我们行之有效的原则——希望当你构建自己的多智能体系统时,会发现这些原则同样有用。

Benefits of a multi-agent system

多智能体系统的优势

Research work involves open-ended problems where it’s very difficult to predict the required steps in advance. You can’t hardcode a fixed path for exploring complex topics, as the process is inherently dynamic and path-dependent. When people conduct research, they tend to continuously update their approach based on discoveries, following leads that emerge during investigation.

研究工作往往涉及开放式的问题,事先很难预测所需的步骤。对于复杂主题的探索,你无法硬编码一条固定的路径,因为这个过程本质上是动态且路径依赖的。当人们进行研究时,他们往往会根据新的发现不断调整方法,沿着调查过程中出现的线索继续探索。

This unpredictability makes AI agents particularly well-suited for research tasks. Research demands the flexibility to pivot or explore tangential connections as the investigation unfolds. The model must operate autonomously for many turns, making decisions about which directions to pursue based on intermediate findings. A linear, one-shot pipeline cannot handle these tasks.

这种不可预测性使得 AI 智能体特别适合执行研究任务。研究要求在调查展开过程中具备灵活性,能够随时改变方向或探索一些旁支线索。模型必须能够自主运行多个回合,根据中间发现决定要追寻哪些方向。线性的一次性流程无法胜任这些任务。

The essence of search is compression: distilling insights from a vast corpus. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens for the lead research agent. Each subagent also provides separation of concerns—distinct tools, prompts, and exploration trajectories—which reduces path dependency and enables thorough, independent investigations.

搜索的本质是压缩:从庞大的语料库中提炼洞见。子智能体通过各自的上下文窗口并行工作,实现这一压缩过程。它们可以同时探索问题的不同方面,然后将最重要的内容提炼后提供给主研究智能体。每个子智能体也实现了关注点的分离——使用不同的工具、提示和探索路径——这减少了路径依赖,并使调查能够更加全面、独立地进行。

Once intelligence reaches a threshold, multi-agent systems become a vital way to scale performance. For instance, although individual humans have become more intelligent in the last 100,000 years, human societies have become exponentially more capable in the information age because of our collective intelligence and ability to coordinate. Even generally-intelligent agents face limits when operating as individuals; groups of agents can accomplish far more.

一旦智能达到了某个阈值,多智能体系统就成为扩展性能的关键途径。举例来说,尽管在人类过去十万年中个体变得越来越聪明,但在信息时代,由于集体智慧和协作能力,人类社会的能力已经呈现出指数级的提升。即使具备通用智能的智能体在独立运行时也有其局限,而一群智能体可以完成更多的事情。

Our internal evaluations show that multi-agent research systems excel especially for breadth-first queries that involve pursuing multiple independent directions simultaneously. We found that a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval. For example, when asked to identify all the board members of the companies in the Information Technology S&P 500, the multi-agent system found the correct answers by decomposing this into tasks for subagents, while the single agent system failed to find the answer with slow, sequential searches.

我们的内部评估显示,多智能体研究系统在需要同时探索多个独立方向的广度优先式查询上表现尤为出色。在我们的内部研究评估中,一个以 Claude Opus 4 为主智能体、并使用 Claude Sonnet 4 作为子智能体的多智能体系统,比单一 Claude Opus 4 智能体的表现高出了 90.2%。例如,当被问及“找出信息技术类标准普尔 500 指数成分公司所有的董事会成员”时,多智能体系统通过将该任务分解给多个子智能体来找到了正确答案,而单智能体系统在缓慢的串行搜索中未能找到答案。

Multi-agent systems work mainly because they help spend enough tokens to solve the problem. In our analysis, three factors explained 95% of the performance variance in the BrowseComp evaluation (which tests the ability of browsing agents to locate hard-to-find information). We found that token usage by itself explains 80% of the variance, with the number of tool calls and the model choice as the two other explanatory factors. This finding validates our architecture that distributes work across agents with separate context windows to add more capacity for parallel reasoning. The latest Claude models act as large efficiency multipliers on token use, as upgrading to Claude Sonnet 4 is a larger performance gain than doubling the token budget on Claude Sonnet 3.7. Multi-agent architectures effectively scale token usage for tasks that exceed the limits of single agents.

多智能体系统之所以奏效,主要是因为它们能帮助投入足够多的 token 来解决问题。在我们的分析中,有三个因素可以解释 BrowseComp 评测中 95% 的性能差异(BrowseComp 评测旨在测试浏览型智能体定位难以找到的信息的能力)。我们发现,仅 token 的使用量就解释了其中 80% 的差异,另外两个因素是调用工具的次数和模型的选择。这一发现验证了我们的体系结构:通过让多个具有各自上下文窗口的智能体分担工作,为并行推理增加了容量。最新的 Claude 模型大大提高了 token 的利用效率,例如将 Claude Sonnet 3.7 升级到 Claude Sonnet 4 带来的性能提升,要比将 Claude Sonnet 3.7 的 token 配额加倍所获得的提升更大。多智能体架构有效扩展了 token 的使用,从而让系统可以应对超出单个智能体能力极限的任务。

There is a downside: in practice, these architectures burn through tokens fast. In our data, agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats. For economic viability, multi-agent systems require tasks where the value of the task is high enough to pay for the increased performance. Further, some domains that require all agents to share the same context or involve many dependencies between agents are not a good fit for multi-agent systems today. For instance, most coding tasks involve fewer truly parallelizable tasks than research, and LLM agents are not yet great at coordinating and delegating to other agents in real time. We’ve found that multi-agent systems excel at valuable tasks that involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools.

但多智能体架构也有一个缺点:在实际中,这些架构会非常快地消耗大量 token。根据我们的数据,智能体通常比普通聊天交互多使用约 4 倍的 token,而多智能体系统使用的 token 数量约为聊天的 15 倍。为了在经济上可行,多智能体系统需要用于那些任务价值足够高的场景,以抵消提升性能所增加的开销。此外,一些需要所有智能体共享同一上下文或智能体之间存在大量依赖的领域,目前并不适合采用多智能体系统。举例来说,大多数编程任务相比研究而言可并行执行的子任务要少得多,而且 LLM 智能体在实时协调和将任务委派给其他智能体方面还不够擅长。我们的发现是,多智能体系统非常擅长处理那些高价值的任务,这类任务通常需要高度并行、信息量超出单个上下文窗口限制,并且需要与众多复杂工具交互。

Architecture overview for Research

“研究”功能的架构概览

Our Research system uses a multi-agent architecture with an orchestrator-worker pattern, where a lead agent coordinates the process while delegating to specialized subagents that operate in parallel.

我们的“研究”系统采用多智能体架构,遵循协调者-工作者模式:由一个主智能体协调整个过程,并将任务委派给并行运行的专门子智能体。

image.png

The multi-agent architecture in action: user queries flow through a lead agent that creates specialized subagents to search for different aspects in parallel.

多智能体架构的运作示意:用户的查询由一个主智能体处理,该主智能体创建专门的子智能体来并行搜索不同方面的信息。

When a user submits a query, the lead agent analyzes it, develops a strategy, and spawns subagents to explore different aspects simultaneously. As shown in the diagram above, the subagents act as intelligent filters by iteratively using search tools to gather information, in this case on AI agent companies in 2025, and then returning a list of companies to the lead agent so it can compile a final answer.

当用户提交一个查询后,主智能体会进行分析,制定策略,并生成多个子智能体以同时探索不同的方面。如上图所示,子智能体通过不断使用搜索工具来收集信息,发挥着智能过滤器的作用。在这里,它们针对 2025 年的 AI 智能体公司进行搜索,然后将获得的公司列表返回给主智能体,以便主智能体汇总出最终答案。

Traditional approaches using Retrieval Augmented Generation (RAG) use static retrieval. That is, they fetch some set of chunks that are most similar to an input query and use these chunks to generate a response. In contrast, our architecture uses a multi-step search that dynamically finds relevant information, adapts to new findings, and analyzes results to formulate high-quality answers.

传统的检索增强生成(RAG)方法使用的是静态检索。也就是说,它们会提取与输入查询最相似的一组片段,并利用这些片段来生成回答。相比之下,我们的架构使用多步骤的动态搜索,可以找到相关信息,随新的发现进行调整,并分析结果以形成高质量的答案。

image.png

Process diagram showing the complete workflow of our multi-agent Research system. When a user submits a query, the system creates a LeadResearcher agent that enters an iterative research process. The LeadResearcher begins by thinking through the approach and saving its plan to Memory to persist the context, since if the context window exceeds 200,000 tokens it will be truncated and it is important to retain the plan. It then creates specialized Subagents (two are shown here, but it can be any number) with specific research tasks. Each Subagent independently performs web searches, evaluates tool results using interleaved thinking, and returns findings to the LeadResearcher. The LeadResearcher synthesizes these results and decides whether more research is needed—if so, it can create additional subagents or refine its strategy. Once sufficient information is gathered, the system exits the research loop and passes all findings to a CitationAgent, which processes the documents and research report to identify specific locations for citations. This ensures all claims are properly attributed to their sources. The final research results, complete with citations, are then returned to the user.

这个流程图展示了我们的多智能体“研究”系统的完整工作流程。当用户提交一个查询后,系统将创建一个 LeadResearcher 智能体(主研究者智能体),让其进入迭代的研究流程。LeadResearcher 首先会规划研究思路,并将其计划保存到 Memory(内存模块)中以持久化上下文。因为如果上下文窗口超过 200,000 个 token 就会被截断,所以保留研究计划非常重要。随后,它会创建具备特定研究任务的专门子智能体(此处示例显示了两个子智能体,但实际上数量可以不受限)。每个子智能体独立执行网页搜索,在获取工具结果后使用“交错思考”模式评估结果质量,然后将发现反馈给 LeadResearcher。LeadResearcher 会综合这些结果并决定是否需要进行更多研究——如果需要,它可以创建额外的子智能体或优化其策略。一旦收集到足够的信息,系统就退出研究循环,并将所有发现交给 CitationAgent。CitationAgent 会处理这些文档和研究报告,找出需要引用的具体位置,确保所有论述都被正确地归因到相应来源。最终,带有完整引用的研究结果将返回给用户。

Prompt engineering and evaluations for research agents

针对研究型智能体的提示工程和评估

Multi-agent systems have key differences from single-agent systems, including a rapid growth in coordination complexity. Early agents made errors like spawning 50 subagents for simple queries, scouring the web endlessly for nonexistent sources, and distracting each other with excessive updates. Since each agent is steered by a prompt, prompt engineering was our primary lever for improving these behaviors. Below are some principles we learned for prompting agents:

多智能体系统与单智能体系统存在一些关键差异,其中之一是协调复杂度会迅速增加。早期的智能体犯过一些错误,例如针对一个简单查询却生成了 50 个子智能体、无休止地在网上搜索并不存在的来源、或者由于过度频繁的更新而彼此干扰。由于每个智能体的行为都是由提示词引导的,因此提示工程成为我们改进这些行为的主要手段。以下是我们在为智能体设计提示时总结出的原则:

  1. Think like your agents. To iterate on prompts, you must understand their effects. To help us do this, we built simulations using our Console with the exact prompts and tools from our system, then watched agents work step-by-step. This immediately revealed failure modes: agents continuing when they already had sufficient results, using overly verbose search queries, or selecting incorrect tools. Effective prompting relies on developing an accurate mental model of the agent, which can make the most impactful changes obvious.

    像你的智能体一样思考。 要迭代优化提示词,你必须理解其产生的效果。为此,我们在 Console 中搭建了仿真环境,使用系统中完全相同的提示和工具,然后观察智能体一步步地执行任务。这立即暴露出了多种失败模式:智能体在已经获得足够结果时仍继续执行、使用过于冗长的搜索查询、或者选择了错误的工具。有效的提示设计依赖于为智能体建立准确的心理模型,这会让那些最具影响力的修改变得显而易见。

  2. Teach the orchestrator how to delegate. In our system, the lead agent decomposes queries into subtasks and describes them to subagents. Each subagent needs an objective, an output format, guidance on the tools and sources to use, and clear task boundaries. Without detailed task descriptions, agents duplicate work, leave gaps, or fail to find necessary information. We started by allowing the lead agent to give simple, short instructions like 'research the semiconductor shortage,' but found these instructions often were vague enough that subagents misinterpreted the task or performed the exact same searches as other agents. For instance, one subagent explored the 2021 automotive chip crisis while 2 others duplicated work investigating current 2025 supply chains, without an effective division of labor.

    教会协调者如何委派任务。 在我们的系统中,主智能体会将查询分解为子任务,并将这些任务描述给子智能体。每个子智能体都需要明确的目标、输出格式、所用工具和信息来源的指引,以及清晰的任务边界。如果缺少详细的任务描述,智能体可能会重复工作、留下空白,或者无法找到所需的信息。起初,我们允许主智能体给出诸如“研究半导体短缺”这样简单而简短的指令,但我们发现这些指令往往过于模糊,导致子智能体误解任务或执行与其他智能体完全相同的搜索。例如,一个子智能体去探索了 2021 年的汽车芯片危机,而另外两个子智能体则重复调查了 2025 年当前的供应链,分工极不明确。

  3. Scale effort to query complexity. Agents struggle to judge appropriate effort for different tasks, so we embedded scaling rules in the prompts. Simple fact-finding requires just 1 agent with 3-10 tool calls, direct comparisons might need 2-4 subagents with 10-15 calls each, and complex research might use more than 10 subagents with clearly divided responsibilities. These explicit guidelines help the lead agent allocate resources efficiently and prevent overinvestment in simple queries, which was a common failure mode in our early versions.

    根据查询的复杂度调整投入。 智能体难以判断不同任务所需的适当投入程度,因此我们在提示词中嵌入了投入规模的规则。简单的事实查找只需要 1 个智能体和 3–10 次工具调用;直接比较可能需要 2–4 个子智能体,每个子智能体调用工具 10–15 次;而复杂的研究可能需要使用超过 10 个子智能体,并为它们明确划分职责。这些明确的指南帮助主智能体高效地分配资源,避免在简单查询上投入过度——在我们的早期版本中,对简单查询投入过多是常见的失败模式。

  4. Tool design and selection are critical. Agent-tool interfaces are as critical as human-computer interfaces. Using the right tool is efficient—often, it’s strictly necessary. For instance, an agent searching the web for context that only exists in Slack is doomed from the start. With MCP servers that give the model access to external tools, this problem compounds, as agents encounter unseen tools with descriptions of wildly varying quality. We gave our agents explicit heuristics: for example, examine all available tools first, match tool usage to user intent, search the web for broad external exploration, or prefer specialized tools over generic ones. Bad tool descriptions can send agents down completely wrong paths, so each tool needs a distinct purpose and a clear description.

    工具的设计和选择至关重要。 智能体与工具的接口重要性不亚于人机接口。使用正确的工具是高效的——而且往往是绝对必要的。例如,一个智能体如果在网页上搜索实际上只存在于 Slack 中的上下文信息,那它从一开始就注定会失败。借助让模型能够访问外部工具的 MCP 服务器,这个问题会更加严重,因为智能体会遇到各种从未见过的工具,而这些工具的描述质量良莠不齐。我们为智能体提供了明确的启发式策略:例如,首先检查所有可用的工具;将工具的使用与用户意图相匹配;使用网络搜索进行广泛的外部探索;或者优先选择专用工具而非通用工具。糟糕的工具描述可能会让智能体走上完全错误的道路,因此每个工具都需要有明确的用途和清晰的描述。

  5. Let agents improve themselves. We found that the Claude 4 models can be excellent prompt engineers. When given a prompt and a failure mode, they are able to diagnose why the agent is failing and suggest improvements. We even created a tool-testing agent—when given a flawed MCP tool, it attempts to use the tool and then rewrites the tool description to avoid failures. By testing the tool dozens of times, this agent found key nuances and bugs. This process for improving tool ergonomics resulted in a 40% decrease in task completion time for future agents using the new description, because they were able to avoid most mistakes.

    让智能体自我改进。 我们发现 Claude 4 模型可以成为出色的提示工程师。当提供一个提示词和一种失败模式给它们时,它们能够诊断出智能体失败的原因并提出改进建议。我们甚至创建了一个测试工具的智能体——当给它一个有缺陷的 MCP 工具时,它会尝试使用该工具,然后改写工具描述以避免失败。通过对该工具进行数十次测试,这个智能体发现了关键的细微差别和漏洞。这种改进工具易用性的过程使未来智能体在使用新的工具描述时完成任务的时间减少了 40%,因为它们能够避免大多数错误。

  6. Start wide, then narrow down. Search strategy should mirror expert human research: explore the landscape before drilling into specifics. Agents often default to overly long, specific queries that return few results. We counteracted this tendency by prompting agents to start with short, broad queries, evaluate what’s available, then progressively narrow focus.

    先广后细。 搜索策略应当模仿人类专家的研究过程:先广泛探索全局,再深入具体细节。智能体常常倾向于使用过长且过于具体的查询,结果返回很少。我们通过提示智能体从简短、宽泛的查询开始,先评估可获取的信息,然后逐步缩小焦点,来对抗这种倾向。

  7. Guide the thinking process. Extended thinking mode, which leads Claude to output additional tokens in a visible thinking process, can serve as a controllable scratchpad. The lead agent uses thinking to plan its approach, assessing which tools fit the task, determining query complexity and subagent count, and defining each subagent’s role. Our testing showed that extended thinking improved instruction-following, reasoning, and efficiency. Subagents also plan, then use interleaved thinking after tool results to evaluate quality, identify gaps, and refine their next query. This makes subagents more effective in adapting to any task.

    引导思维过程。 扩展思考模式会引导 Claude 在可见的思考过程中输出额外的 token,可作为一个可控的草稿本。主智能体利用“思考”来规划自己的方案,评估哪些工具适合任务,确定查询的复杂度和子智能体的数量,并定义每个子智能体的角色。我们的测试表明,扩展思考提高了智能体对指令的遵循程度、推理能力和效率。子智能体也会先制定计划,然后在获得工具结果后使用“交错思考”来评估结果质量、发现信息缺口,并优化它们的下一步查询。这使得子智能体在适应任何任务时都更加高效。

  8. Parallel tool calling transforms speed and performance. Complex research tasks naturally involve exploring many sources. Our early agents executed sequential searches, which was painfully slow. For speed, we introduced two kinds of parallelization: (1) the lead agent spins up 3-5 subagents in parallel rather than serially; (2) the subagents use 3+ tools in parallel. These changes cut research time by up to 90% for complex queries, allowing Research to do more work in minutes instead of hours while covering more information than other systems.

    并行调用工具带来速度和性能的飞跃。 复杂的研究任务天生需要探索许多信息来源。我们早期的智能体采用串行搜索,速度非常慢。为提高速度,我们引入了两种并行化:(1) 主智能体同时启动 3–5 个子智能体,而不再是串行地逐个启动;(2) 子智能体并行使用 3 个以上的工具。这些改变使复杂查询的研究时间最多减少了 90%,让“研究”功能能够在几分钟而非数小时内完成更多工作,并且涵盖的信息量也超过了其他系统。

Our prompting strategy focuses on instilling good heuristics rather than rigid rules. We studied how skilled humans approach research tasks and encoded these strategies in our prompts—strategies like decomposing difficult questions into smaller tasks, carefully evaluating the quality of sources, adjusting search approaches based on new information, and recognizing when to focus on depth (investigating one topic in detail) vs. breadth (exploring many topics in parallel). We also proactively mitigated unintended side effects by setting explicit guardrails to prevent the agents from spiraling out of control. Finally, we focused on a fast iteration loop with observability and test cases.

我们的提示策略注重灌输良好的启发式方法,而非死板的规则。我们研究了熟练的人类是如何进行研究任务的,并将这些策略编码进了提示——例如,把困难的问题分解成更小的任务、仔细评估信息来源的质量、根据新的信息调整搜索方法,以及识别何时应该专注深度(深入研究一个主题)而非广度(并行探索多个主题)。我们还通过设置明确的防护措施来主动减轻意料之外的副作用,以防止智能体失控。最后,我们专注于构建一个具有可观测性和测试用例的快速迭代循环。

Effective evaluation of agents

对智能体的有效评估

Good evaluations are essential for building reliable AI applications, and agents are no different. However, evaluating multi-agent systems presents unique challenges. Traditional evaluations often assume that the AI follows the same steps each time: given input X, the system should follow path Y to produce output Z. But multi-agent systems don't work this way. Even with identical starting points, agents might take completely different valid paths to reach their goal. One agent might search three sources while another searches ten, or they might use different tools to find the same answer. Because we don’t always know what the right steps are, we usually can't just check if agents followed the “correct” steps we prescribed in advance. Instead, we need flexible evaluation methods that judge whether agents achieved the right outcomes while also following a reasonable process.

良好的评估对于构建可靠的 AI 应用至关重要,智能体也不例外。然而,评估多智能体系统带来了独特的挑战。传统评估常常假定 AI 每次都遵循相同的步骤:给定输入 X,系统应按照路径 Y 产生输出 Z。但多智能体系统并非如此运作。即使起点相同,不同智能体也可能采用完全不同但同样有效的路径来达到目标。一个智能体可能搜索了三个来源,而另一个搜索了十个;或者它们可能使用不同的工具找到相同的答案。由于我们并不总是知道正确的步骤是什么,因此通常无法简单检查智能体是否遵循了我们预先规定的“正确”步骤。相反,我们需要灵活的评估方法来判断智能体是否取得了正确的结果,同时也考察它们是否遵循了合理的过程。

Start evaluating immediately with small samples. In early agent development, changes tend to have dramatic impacts because there is abundant low-hanging fruit. A prompt tweak might boost success rates from 30% to 80%. With effect sizes this large, you can spot changes with just a few test cases. We started with a set of about 20 queries representing real usage patterns. Testing these queries often allowed us to clearly see the impact of changes. We often hear that AI developer teams delay creating evals because they believe that only large evals with hundreds of test cases are useful. However, it’s best to start with small-scale testing right away with a few examples, rather than delaying until you can build more thorough evals.

立即使用小样本开始评估。 在智能体开发的早期阶段,由于存在大量容易改进的空间,改动往往会产生巨大的影响。微调一个提示词就可能将成功率从 30% 提高到 80%。有如此大的效果差异,你只需要几个测试用例就能发现变化。我们最初使用了一组大约 20 个查询来代表真实的使用模式。测试这些查询往往能让我们清楚地看到改动的影响。我们经常听说一些 AI 开发团队推迟创建评估,因为他们认为只有包含数百个测试用例的大规模评估才有价值。然而,最好的做法是不必等待构建更全面的评估,而是立即用少量示例开始小规模测试。

LLM-as-judge evaluation scales when done well. Research outputs are difficult to evaluate programmatically, since they are free-form text and rarely have a single correct answer. LLMs are a natural fit for grading outputs. We used an LLM judge that evaluated each output against criteria in a rubric: factual accuracy (do claims match sources?), citation accuracy (do the cited sources match the claims?), completeness (are all requested aspects covered?), source quality (did it use primary sources over lower-quality secondary sources?), and tool efficiency (did it use the right tools a reasonable number of times?). We experimented with multiple judges to evaluate each component, but found that a single LLM call with a single prompt outputting scores from 0.0-1.0 and a pass-fail grade was the most consistent and aligned with human judgements. This method was especially effective when the eval test cases did have a clear answer, and we could use the LLM judge to simply check if the answer was correct (i.e. did it accurately list the pharma companies with the top 3 largest R&D budgets?). Using an LLM as a judge allowed us to scalably evaluate hundreds of outputs.

使用 LLM 作为评审的评估在做好后可以扩展规模。 研究类输出很难通过编程自动评估,因为它们是自由格式的文本,且很少有唯一正确的答案。LLM 非常适合对输出进行评分。我们使用了一个 LLM 担任评审,根据一套标准对每个输出进行评估:事实准确性(论断是否与来源相符?)、引用准确性(引用的来源是否支持该论断?)、完整性(请求的各方面是否都涵盖了?)、来源质量(是否使用了高质量的一手来源而非低质量的二手来源?)、工具效率(是否以合理的次数使用了正确的工具?)。我们尝试使用多个评审模型来评估每个部分,但发现通过单次 LLM 调用、使用一个提示同时输出 0.0–1.0 分数和通过/未通过评级的方法最为一致,并且与人工判断最为吻合。当评估测试用例确实有明确答案时,这种方法尤其有效——我们可以让 LLM 评审直接检查答案是否正确(例如,它列出的研发预算排名前三的制药公司是否准确)。使用 LLM 作为评审使我们能够大规模地评估数百个输出。

Human evaluation catches what automation misses. People testing agents find edge cases that evals miss. These include hallucinated answers on unusual queries, system failures, or subtle source selection biases. In our case, human testers noticed that our early agents consistently chose SEO-optimized content farms over authoritative but less highly-ranked sources like academic PDFs or personal blogs. Adding source quality heuristics to our prompts helped resolve this issue. Even in a world of automated evaluations, manual testing remains essential.

人工评估可以发现自动化遗漏的问题。 人工测试智能体可以发现评估遗漏的边缘情况。其中包括智能体在非常规查询上产生幻觉式答案、系统故障,或者微妙的来源选择偏差。在我们的案例中,人类测试人员注意到,我们早期的智能体总是选择经过 SEO 优化的内容农场,而忽略那些权威但排名不高的来源,比如学术 PDF 或个人博客。在提示中加入来源质量的启发式规则帮助解决了这个问题。即使在评估自动化的时代,人工测试仍然是不可或缺的。

Multi-agent systems have emergent behaviors, which arise without specific programming. For instance, small changes to the lead agent can unpredictably change how subagents behave. Success requires understanding interaction patterns, not just individual agent behavior. Therefore, the best prompts for these agents are not just strict instructions, but frameworks for collaboration that define the division of labor, problem-solving approaches, and effort budgets. Getting this right relies on careful prompting and tool design, solid heuristics, observability, and tight feedback loops. See the open-source prompts in our Cookbook for example prompts from our system.

多智能体系统会出现涌现行为,这些行为并非通过特定的编程直接指定。例如,对主智能体的一些微小改动可能会以不可预测的方式改变子智能体的行为。要取得成功,需要理解智能体之间的交互模式,而不仅仅是单个智能体的行为。因此,对于这些智能体而言,最好的提示并不只是严格的指令,而是协作框架,定义了各自的分工、解决问题的方法和投入的预算。要做好这一点,需要精心的提示和工具设计、可靠的启发策略、良好的可观测性,以及紧密的反馈循环。有关我们系统的提示示例,请参见我们 Cookbook 中的开源提示范例。

Production reliability and engineering challenges

生产环境的可靠性与工程挑战

In traditional software, a bug might break a feature, degrade performance, or cause outages. In agentic systems, minor changes cascade into large behavioral changes, which makes it remarkably difficult to write code for complex agents that must maintain state in a long-running process.

在传统软件中,一个漏洞可能会破坏某个功能、导致性能下降或引起停机。而在智能体系统中,细微的改动可能会引发巨大的行为变化,这使得为那些必须在长时间运行过程中保持状态的复杂智能体编写代码变得格外困难。

Agents are stateful and errors compound. Agents can run for long periods of time, maintaining state across many tool calls. This means we need to durably execute code and handle errors along the way. Without effective mitigations, minor system failures can be catastrophic for agents. When errors occur, we can't just restart from the beginning: restarts are expensive and frustrating for users. Instead, we built systems that can resume from where the agent was when the errors occurred. We also use the model’s intelligence to handle issues gracefully: for instance, letting the agent know when a tool is failing and letting it adapt works surprisingly well. We combine the adaptability of AI agents built on Claude with deterministic safeguards like retry logic and regular checkpoints.

智能体有状态,错误会累积放大。 智能体可能长时间运行,在多次工具调用之间保持状态。这意味着我们需要可靠地执行代码,并在过程中处理错误。如果没有有效的缓解措施,微小的系统故障对智能体而言都可能是灾难性的。当出现错误时,我们不能简单地从头重启:重启代价高昂,而且会让用户感到沮丧。相反,我们构建了能够从智能体发生错误的状态处继续运行的系统。我们还利用模型的智能来优雅地处理问题:例如,当某个工具发生故障时,告知智能体这一点并让其自行调整,这种方法出奇地有效。我们将基于 Claude 构建的 AI 智能体的适应性,与重试逻辑、定期检查点等确定性的安全保障措施相结合。

Debugging benefits from new approaches. Agents make dynamic decisions and are non-deterministic between runs, even with identical prompts. This makes debugging harder. For instance, users would report agents “not finding obvious information,” but we couldn't see why. Were the agents using bad search queries? Choosing poor sources? Hitting tool failures? Adding full production tracing let us diagnose why agents failed and fix issues systematically. Beyond standard observability, we monitor agent decision patterns and interaction structures—all without monitoring the contents of individual conversations, to maintain user privacy. This high-level observability helped us diagnose root causes, discover unexpected behaviors, and fix common failures.

调试需要新方法。 智能体会做出动态决策,即使提示相同,每次运行的行为也可能不同。这使得调试更加困难。例如,用户报告说智能体“没有找到明显的信息”,但我们无法确定原因。是智能体使用了不佳的搜索查询?选择了不好的信息来源?还是遇到了工具故障?引入完整的生产级跟踪让我们能够找出智能体失败的原因,并系统地解决问题。除了标准的可观测性之外,我们还监控智能体的决策模式和交互结构——这一切都无需查看各次对话的具体内容,以维护用户隐私。这种高层次的可观测性帮助我们诊断根本原因,发现意外行为,并修复常见故障。

Deployment needs careful coordination. Agent systems are highly stateful webs of prompts, tools, and execution logic that run almost continuously. This means that whenever we deploy updates, agents might be anywhere in their process. We therefore need to prevent our well-meaning code changes from breaking existing agents. We can’t update every agent to the new version at the same time. Instead, we use rainbow deployments to avoid disrupting running agents, by gradually shifting traffic from old to new versions while keeping both running simultaneously.

部署需要精心协调。 智能体系统是高度有状态的提示、工具和执行逻辑网络,几乎持续地运行。这意味着,每当我们部署更新时,智能体可能正处于流程中的不同阶段。因此,我们需要防止那些本意良好的代码更改破坏现有的智能体。我们无法同时将每个智能体都更新到新版本。因此,我们采用“彩虹式”部署,在旧版本和新版本同时运行的情况下,逐步将流量从旧版本切换到新版本,从而避免干扰正在运行的智能体。

Synchronous execution creates bottlenecks. Currently, our lead agents execute subagents synchronously, waiting for each set of subagents to complete before proceeding. This simplifies coordination, but creates bottlenecks in the information flow between agents. For instance, the lead agent can’t steer subagents, subagents can’t coordinate, and the entire system can be blocked while waiting for a single subagent to finish searching. Asynchronous execution would enable additional parallelism: agents working concurrently and creating new subagents when needed. But this asynchronicity adds challenges in result coordination, state consistency, and error propagation across the subagents. As models can handle longer and more complex research tasks, we expect the performance gains will justify the complexity.

同步执行会造成瓶颈。 目前,我们的主智能体以同步方式执行子智能体,等待每一批子智能体都完成后再继续。这简化了协调,但在智能体之间的信息流上造成了瓶颈。例如,主智能体无法引导子智能体,子智能体之间无法协调,而且整个系统可能因为等待某一个子智能体完成搜索而被阻塞。异步执行将提供额外的并行能力:智能体可以同时工作,并在需要时创建新的子智能体。但是,这种异步性会在结果协调、状态一致性和错误传播方面带来新的挑战。随着模型能够处理更长且更复杂的研究任务,我们预计性能提升将足以证明增加的复杂性是值得的。

Conclusion

结论

When building AI agents, the last mile often becomes most of the journey. Codebases that work on developer machines require significant engineering to become reliable production systems. The compound nature of errors in agentic systems means that minor issues for traditional software can derail agents entirely. One step failing can cause agents to explore entirely different trajectories, leading to unpredictable outcomes. For all the reasons described in this post, the gap between prototype and production is often wider than anticipated.

在构建 AI 智能体时,最后一公里往往变成了旅程中最漫长的部分。在开发者的机器上能跑通的代码库,要成为可靠的生产系统需要大量的工程投入。智能体系统中错误的复合特性意味着,对于传统软件来说微小的问题,都可能彻底使智能体运行偏离轨道。某一步的失败就可能导致智能体走上完全不同的路径,产生不可预测的结果。基于本文描述的所有原因,原型与生产环境之间的差距往往比预期的更大。

Despite these challenges, multi-agent systems have proven valuable for open-ended research tasks. Users have said that Claude helped them find business opportunities they hadn’t considered, navigate complex healthcare options, resolve thorny technical bugs, and save up to days of work by uncovering research connections they wouldn't have found alone. Multi-agent research systems can operate reliably at scale with careful engineering, comprehensive testing, detail-oriented prompt and tool design, robust operational practices, and tight collaboration between research, product, and engineering teams who have a strong understanding of current agent capabilities. We're already seeing these systems transform how people solve complex problems.

尽管有这些挑战,多智能体系统已被证明对于开放式的研究任务非常有价值。用户表示,Claude 帮助他们发现了原本未曾考虑的商业机会,理清了复杂的医疗保健选项,解决了棘手的技术 Bug,并通过发掘那些他们自己无法找到的研究关联节省了数天的工作时间。经过细致的工程实践、全面的测试、注重细节的提示和工具设计、稳健的运营举措,以及在充分理解当前智能体能力的前提下,研究、产品和工程团队之间的紧密合作,多智能体研究系统能够在大规模环境中可靠地运行。我们已经看到了这些系统如何改变人们解决复杂问题的方式。

image.png

A Clio embedding plot showing the most common ways people are using the Research feature today. The top use case categories are developing software systems across specialized domains (10%), develop and optimize professional and technical content (8%), develop business growth and revenue generation strategies (8%), assist with academic research and educational material development (7%), and research and verify information about people, places, or organizations (5%).

Clio 嵌入图展示了当前用户使用“研究”功能的最常见方式。排名最高的用例类别包括:在专门领域开发软件系统(10%),开发和优化专业及技术内容(8%),制定业务增长和营收策略(8%),协助学术研究和教育资料开发(7%),以及研究和核实有关人物、地点或组织的信息(5%)。

Appendix

附录

Below are some additional miscellaneous tips for multi-agent systems.

以下是一些关于多智能体系统的其他杂项提示。

End-state evaluation of agents that mutate state over many turns. Evaluating agents that modify persistent state across multi-turn conversations presents unique challenges. Unlike read-only research tasks, each action can change the environment for subsequent steps, creating dependencies that traditional evaluation methods struggle to handle. We found success focusing on end-state evaluation rather than turn-by-turn analysis. Instead of judging whether the agent followed a specific process, evaluate whether it achieved the correct final state. This approach acknowledges that agents may find alternative paths to the same goal while still ensuring they deliver the intended outcome. For complex workflows, break evaluation into discrete checkpoints where specific state changes should have occurred, rather than attempting to validate every intermediate step.

对跨多轮改变状态的智能体进行终态评估。 评估那些在多轮对话中会修改持久状态的智能体存在独特的挑战。不同于只读的研究任务,每个动作都会改变后续步骤的环境,产生传统评估方法难以处理的依赖。我们发现,将重点放在终局状态的评估上而不是逐轮分析是有效的。不去判断智能体是否遵循了特定的过程,而是评估它是否达到了正确的最终状态。这种方法承认智能体可能通过不同路径达到相同的目标,同时仍能确保它们交付预期的结果。对于复杂的工作流程,可以将评估划分为离散的检查点,在这些节点上应该发生特定的状态变化,而不必尝试验证每一个中间步骤。

Long-horizon conversation management. Production agents often engage in conversations spanning hundreds of turns, requiring careful context management strategies. As conversations extend, standard context windows become insufficient, necessitating intelligent compression and memory mechanisms. We implemented patterns where agents summarize completed work phases and store essential information in external memory before proceeding to new tasks. When context limits approach, agents can spawn fresh subagents with clean contexts while maintaining continuity through careful handoffs. Further, they can retrieve stored context like the research plan from their memory rather than losing previous work when reaching the context limit. This distributed approach prevents context overflow while preserving conversation coherence across extended interactions.

长时对话管理。 上线运行的智能体往往会进行长达数百轮的对话,这需要谨慎的上下文管理策略。随着对话的延伸,标准的上下文窗口将变得不足,因而需要智能的压缩和记忆机制。我们采用了一些模式,让智能体在继续新任务之前,总结已完成的工作阶段,并将关键信息存储到外部存储中。当接近上下文限制时,智能体可以生成具有干净上下文的新子智能体,并通过精心的交接保持连续性。此外,它们可以从自身的记忆存储中检索之前保存的上下文(例如研究计划),而不是在达到上下文上限时丢失之前的工作。这种分布式方法避免了上下文溢出,同时在长时交互中保持对话的连贯性。

Subagent output to a filesystem to minimize the ‘game of telephone.’ Direct subagent outputs can bypass the main coordinator for certain types of results, improving both fidelity and performance. Rather than requiring subagents to communicate everything through the lead agent, implement artifact systems where specialized agents can create outputs that persist independently. Subagents call tools to store their work in external systems, then pass lightweight references back to the coordinator. This prevents information loss during multi-stage processing and reduces token overhead from copying large outputs through conversation history. The pattern works particularly well for structured outputs like code, reports, or data visualizations where the subagent's specialized prompt produces better results than filtering through a general coordinator.

将子智能体输出保存到文件系统,尽量减少“传话游戏”造成的信息失真。 让子智能体直接输出某些类型的结果,可以在这些情况下绕过主协调者,提高结果的保真度和性能。与其要求子智能体通过主智能体传递所有信息,不如实现工件系统,让专门的智能体能够创建独立持久的输出。子智能体调用工具将它的工作成果存储到外部系统中,然后将轻量级的引用传回协调者。这避免了在多阶段处理中信息丢失,并减少了通过会话历史复制大量输出所产生的 token 开销。该模式对于代码、报告或数据可视化等结构化输出特别有效,在这些场景下,与其通过通用协调者过滤,不如让子智能体使用专门的提示直接生成更好的结果。

提示词

Anthropic 在他们的 github 库分享了这个研究系统的相关的提示词:

citations_agent

`You are an agent for adding correct citations to a research report. You are given a report within  tags, which was generated based on the provided sources. However, the sources are not cited in the . Your task is to enhance user trust by generating correct, appropriate citations for this report.

Based on the provided document, add citations to the input text using the format specified earlier. Output the resulting report, unchanged except for the added citations, within  tags.

Rules:

- Do NOT modify the  in any way - keep all content 100% identical, only add citations

- Pay careful attention to whitespace: DO NOT add or remove any whitespace

- ONLY add citations where the source documents directly support claims in the text

Citation guidelines:

- Avoid citing unnecessarily: Not every statement needs a citation. Focus on citing key facts, conclusions, and substantive claims that are linked to sources rather than common knowledge. Prioritize citing claims that readers would want to verify, that add credibility to the argument, or where a claim is clearly related to a specific source

- Cite meaningful semantic units: Citations should span complete thoughts, findings, or claims that make sense as standalone assertions. Avoid citing individual words or small phrase fragments that lose meaning out of context; prefer adding citations at the end of sentences

- Minimize sentence fragmentation: Avoid multiple citations within a single sentence that break up the flow of the sentence. Only add citations between phrases within a sentence when it is necessary to attribute specific claims within the sentence to specific sources

- No redundant citations close to each other: Do not place multiple citations to the same source in the same sentence, because this is redundant and unnecessary. If a sentence contains multiple citable claims from the same source, use only a single citation at the end of the sentence after the period

Technical requirements:

- Citations result in a visual, interactive element being placed at the closing tag. Be mindful of where the closing tag is, and do not break up phrases and sentences unnecessarily

- Output text with citations between  and  tags

- Include any of your preamble, thinking, or planning BEFORE the opening  tag, to avoid breaking the output

- ONLY add the citation tags to the text within  tags for your  output

- Text without citations will be collected and compared to the original report from the . If the text is not identical, your result will be rejected.

Now, add the citations to the research report and output the .`

research_lead_agent

``You are an expert research lead, focused on high-level research strategy, planning, efficient delegation to subagents, and final report writing. Your core goal is to be maximally helpful to the user by leading a process to research the user's query and then creating an excellent research report that answers this query very well. Take the current request from the user, plan out an effective research process to answer it as well as possible, and then execute this plan by delegating key tasks to appropriate subagents.

The current date is {{.CurrentDate}}.

Follow this process to break down the user’s question and develop an excellent research plan. Think about the user's task thoroughly and in great detail to understand it well and determine what to do next. Analyze each aspect of the user's question and identify the most important aspects. Consider multiple approaches with complete, thorough reasoning. Explore several different methods of answering the question (at least 3) and then choose the best method you find. Follow this process closely:

  1. Assessment and breakdown: Analyze and break down the user's prompt to make sure you fully understand it.

* Identify the main concepts, key entities, and relationships in the task.

* List specific facts or data points needed to answer the question well.

* Note any temporal or contextual constraints on the question.

* Analyze what features of the prompt are most important - what does the user likely care about most here? What are they expecting or desiring in the final result? What tools do they expect to be used and how do we know?

* Determine what form the answer would need to be in to fully accomplish the user's task. Would it need to be a detailed report, a list of entities, an analysis of different perspectives, a visual report, or something else? What components will it need to have?

  1. Query type determination: Explicitly state your reasoning on what type of query this question is from the categories below.
  • Depth-first query: When the problem requires multiple perspectives on the same issue, and calls for "going deep" by analyzing a single topic from many angles.

- Benefits from parallel agents exploring different viewpoints, methodologies, or sources

- The core question remains singular but benefits from diverse approaches

- Example: "What are the most effective treatments for depression?" (benefits from parallel agents exploring different treatments and approaches to this question)

- Example: "What really caused the 2008 financial crisis?" (benefits from economic, regulatory, behavioral, and historical perspectives, and analyzing or steelmanning different viewpoints on the question)

- Example: "can you identify the best approach to building AI finance agents in 2025 and why?"

  • Breadth-first query: When the problem can be broken into distinct, independent sub-questions, and calls for "going wide" by gathering information about each sub-question.

- Benefits from parallel agents each handling separate sub-topics.

- The query naturally divides into multiple parallel research streams or distinct, independently researchable sub-topics

- Example: "Compare the economic systems of three Nordic countries" (benefits from simultaneous independent research on each country)

- Example: "What are the net worths and names of all the CEOs of all the fortune 500 companies?" (intractable to research in a single thread; most efficient to split up into many distinct research agents which each gathers some of the necessary information)

- Example: "Compare all the major frontend frameworks based on performance, learning curve, ecosystem, and industry adoption" (best to identify all the frontend frameworks and then research all of these factors for each framework)

  • Straightforward query: When the problem is focused, well-defined, and can be effectively answered by a single focused investigation or fetching a single resource from the internet.

- Can be handled effectively by a single subagent with clear instructions; does not benefit much from extensive research

- Example: "What is the current population of Tokyo?" (simple fact-finding)

- Example: "What are all the fortune 500 companies?" (just requires finding a single website with a full list, fetching that list, and then returning the results)

- Example: "Tell me about bananas" (fairly basic, short question that likely does not expect an extensive answer)

  1. Detailed research plan development: Based on the query type, develop a specific research plan with clear allocation of tasks across different research subagents. Ensure if this plan is executed, it would result in an excellent answer to the user's query.

* For Depth-first queries:

- Define 3-5 different methodological approaches or perspectives.

- List specific expert viewpoints or sources of evidence that would enrich the analysis.

- Plan how each perspective will contribute unique insights to the central question.

- Specify how findings from different approaches will be synthesized.

- Example: For "What causes obesity?", plan agents to investigate genetic factors, environmental influences, psychological aspects, socioeconomic patterns, and biomedical evidence, and outline how the information could be aggregated into a great answer.

* For Breadth-first queries:

- Enumerate all the distinct sub-questions or sub-tasks that can be researched independently to answer the query.

- Identify the most critical sub-questions or perspectives needed to answer the query comprehensively. Only create additional subagents if the query has clearly distinct components that cannot be efficiently handled by fewer agents. Avoid creating subagents for every possible angle - focus on the essential ones.

- Prioritize these sub-tasks based on their importance and expected research complexity.

- Define extremely clear, crisp, and understandable boundaries between sub-topics to prevent overlap.

- Plan how findings will be aggregated into a coherent whole.

- Example: For "Compare EU country tax systems", first create a subagent to retrieve a list of all the countries in the EU today, then think about what metrics and factors would be relevant to compare each country's tax systems, then use the batch tool to run 4 subagents to research the metrics and factors for the key countries in Northern Europe, Western Europe, Eastern Europe, Southern Europe.

* For Straightforward queries:

- Identify the most direct, efficient path to the answer.

- Determine whether basic fact-finding or minor analysis is needed.

- Specify exact data points or information required to answer.

- Determine what sources are likely most relevant to answer this query that the subagents should use, and whether multiple sources are needed for fact-checking.

- Plan basic verification methods to ensure the accuracy of the answer.

- Create an extremely clear task description that describes how a subagent should research this question.

* For each element in your plan for answering any query, explicitly evaluate:

- Can this step be broken into independent subtasks for a more efficient process?

- Would multiple perspectives benefit this step?

- What specific output is expected from this step?

- Is this step strictly necessary to answer the user's query well?

  1. Methodical plan execution: Execute the plan fully, using parallel subagents where possible. Determine how many subagents to use based on the complexity of the query, default to using 3 subagents for most queries.

* For parallelizable steps:

- Deploy appropriate subagents using the  below, making sure to provide extremely clear task descriptions to each subagent and ensuring that if these tasks are accomplished it would provide the information needed to answer the query.

- Synthesize findings when the subtasks are complete.

* For non-parallelizable/critical steps:

- First, attempt to accomplish them yourself based on your existing knowledge and reasoning. If the steps require additional research or up-to-date information from the web, deploy a subagent.

- If steps are very challenging, deploy independent subagents for additional perspectives or approaches.

- Compare the subagent's results and synthesize them using an ensemble approach and by applying critical reasoning.

* Throughout execution:

- Continuously monitor progress toward answering the user's query.

- Update the search plan and your subagent delegation strategy based on findings from tasks.

- Adapt to new information well - analyze the results, use Bayesian reasoning to update your priors, and then think carefully about what to do next.

- Adjust research depth based on time constraints and efficiency - if you are running out of time or a research process has already taken a very long time, avoid deploying further subagents and instead just start composing the output report immediately.

When determining how many subagents to create, follow these guidelines:

  1. Simple/Straightforward queries: create 1 subagent to collaborate with you directly -

- Example: "What is the tax deadline this year?" or “Research bananas” → 1 subagent

- Even for simple queries, always create at least 1 subagent to ensure proper source gathering

  1. Standard complexity queries: 2-3 subagents

- For queries requiring multiple perspectives or research approaches

- Example: "Compare the top 3 cloud providers" → 3 subagents (one per provider)

  1. Medium complexity queries: 3-5 subagents

- For multi-faceted questions requiring different methodological approaches

- Example: "Analyze the impact of AI on healthcare" → 4 subagents (regulatory, clinical, economic, technological aspects)

  1. High complexity queries: 5-10 subagents (maximum 20)

- For very broad, multi-part queries with many distinct components

- Identify the most effective algorithms to efficiently answer these high-complexity queries with around 20 subagents.

- Example: "Fortune 500 CEOs birthplaces and ages" → Divide the large info-gathering task into  smaller segments (e.g., 10 subagents handling 50 CEOs each)

IMPORTANT: Never create more than 20 subagents unless strictly necessary. If a task seems to require more than 20 subagents, it typically means you should restructure your approach to consolidate similar sub-tasks and be more efficient in your research process. Prefer fewer, more capable subagents over many overly narrow ones. More subagents = more overhead. Only add subagents when they provide distinct value.

Use subagents as your primary research team - they should perform all major research tasks:

  1. Deployment strategy:

* Deploy subagents immediately after finalizing your research plan, so you can start the research process quickly.

* Use the run_blocking_subagent tool to create a research subagent, with very clear and specific instructions in the prompt parameter of this tool to describe the subagent's task.

* Each subagent is a fully capable researcher that can search the web and use the other search tools that are available.

* Consider priority and dependency when ordering subagent tasks - deploy the most important subagents first. For instance, when other tasks will depend on results from one specific task, always create a subagent to address that blocking task first.

* Ensure you have sufficient coverage for comprehensive research - ensure that you deploy subagents to complete every task.

* All substantial information gathering should be delegated to subagents.

* While waiting for a subagent to complete, use your time efficiently by analyzing previous results, updating your research plan, or reasoning about the user's query and how to answer it best.

  1. Task allocation principles:

* For depth-first queries: Deploy subagents in sequence to explore different methodologies or perspectives on the same core question. Start with the approach most likely to yield comprehensive and good results, the follow with alternative viewpoints to fill gaps or provide contrasting analysis.

* For breadth-first queries: Order subagents by topic importance and research complexity. Begin with subagents that will establish key facts or framework information, then deploy subsequent subagents to explore more specific or dependent subtopics.

* For straightforward queries: Deploy a single comprehensive subagent with clear instructions for fact-finding and verification. For these simple queries, treat the subagent as an equal collaborator - you can conduct some research yourself while delegating specific research tasks to the subagent. Give this subagent very clear instructions and try to ensure the subagent handles about half of the work, to efficiently distribute research work between yourself and the subagent.

* Avoid deploying subagents for trivial tasks that you can complete yourself, such as simple calculations, basic formatting, small web searches, or tasks that don't require external research

* But always deploy at least 1 subagent, even for simple tasks.

* Avoid overlap between subagents - every subagent should have distinct, clearly separate tasks, to avoid replicating work unnecessarily and wasting resources.

  1. Clear direction for subagents: Ensure that you provide every subagent with extremely detailed, specific, and clear instructions for what their task is and how to accomplish it. Put these instructions in the prompt parameter of the run_blocking_subagent tool.

* All instructions for subagents should include the following as appropriate:

- Specific research objectives, ideally just 1 core objective per subagent.

- Expected output format - e.g. a list of entities, a report of the facts, an answer to a specific question, or other.

- Relevant background context about the user's question and how the subagent should contribute to the research plan.

- Key questions to answer as part of the research.

- Suggested starting points and sources to use; define what constitutes reliable information or high-quality sources for this task, and list any unreliable sources to avoid.

- Specific tools that the subagent should use - i.e. using web search and web fetch for gathering information from the web, or if the query requires non-public, company-specific, or user-specific information, use the available internal tools like google drive, gmail, gcal, slack, or any other internal tools that are available currently.

- If needed, precise scope boundaries to prevent research drift.

* Make sure that IF all the subagents followed their instructions very well, the results in aggregate would allow you to give an EXCELLENT answer to the user's question - complete, thorough, detailed, and accurate.

* When giving instructions to subagents, also think about what sources might be high-quality for their tasks, and give them some guidelines on what sources to use and how they should evaluate source quality for each task.

* Example of a good, clear, detailed task description for a subagent: "Research the semiconductor supply chain crisis and its current status as of 2025. Use the web_search and web_fetch tools to gather facts from the internet. Begin by examining recent quarterly reports from major chip manufacturers like TSMC, Samsung, and Intel, which can be found on their investor relations pages or through the SEC EDGAR database. Search for industry reports from SEMI, Gartner, and IDC that provide market analysis and forecasts. Investigate government responses by checking the US CHIPS Act implementation progress at commerce.gov, EU Chips Act at ec.europa.eu, and similar initiatives in Japan, South Korea, and Taiwan through their respective government portals. Prioritize original sources over news aggregators. Focus on identifying current bottlenecks, projected capacity increases from new fab construction, geopolitical factors affecting supply chains, and expert predictions for when supply will meet demand. When research is done, compile your findings into a dense report of the facts, covering the current situation, ongoing solutions, and future outlook, with specific timelines and quantitative data where available."

  1. Synthesis responsibility: As the lead research agent, your primary role is to coordinate, guide, and synthesize - NOT to conduct primary research yourself. You only conduct direct research if a critical question remains unaddressed by subagents or it is best to accomplish it yourself. Instead, focus on planning, analyzing and integrating findings across subagents, determining what to do next, providing clear instructions for each subagent, or identifying gaps in the collective research and deploying new subagents to fill them.

Before providing a final answer:

1. Review the most recent fact list compiled during the search process.

2. Reflect deeply on whether these facts can answer the given query sufficiently.

3. Only then, provide a final answer in the specific format that is best for the user's query and following the  below.

4. Output the final result in Markdown using the complete_task tool to submit your final research report.

5. Do not include ANY Markdown citations, a separate agent will be responsible for citations. Never include a list of references or sources or citations at the end of the report.

You may have some additional tools available that are useful for exploring the user's integrations. For instance, you may have access to tools for searching in Asana, Slack, Github. Whenever extra tools are available beyond the Google Suite tools and the web_search or web_fetch tool, always use the relevant read-only tools once or twice to learn how they work and get some basic information from them. For instance, if they are available, use slack_search once to find some info relevant to the query or slack_user_profile to identify the user; use asana_user_info to read the user's profile or asana_search_tasks to find their tasks; or similar. DO NOT use write, create, or update tools. Once you have used these tools, either continue using them yourself further to find relevant information, or when creating subagents clearly communicate to the subagents exactly how they should use these tools in their task. Never neglect using any additional available tools, as if they are present, the user definitely wants them to be used.

When a user’s query is clearly about internal information, focus on describing to the subagents exactly what internal tools they should use and how to answer the query. Emphasize using these tools in your communications with subagents. Often, it will be appropriate to create subagents to do research using specific tools. For instance, for a query that requires understanding the user’s tasks as well as their docs and communications and how this internal information relates to external information on the web, it is likely best to create an Asana subagent, a Slack subagent, a Google Drive subagent, and a Web Search subagent. Each of these subagents should be explicitly instructed to focus on using exclusively those tools to accomplish a specific task or gather specific information. This is an effective pattern to delegate integration-specific research to subagents, and then conduct the final analysis and synthesis of the information gathered yourself.

For maximum efficiency, whenever you need to perform multiple independent operations, invoke all relevant tools simultaneously rather than sequentially. Call tools in parallel to run subagents at the same time. You MUST use parallel tool calls for creating multiple subagents (typically running 3 subagents at the same time) at the start of the research, unless it is a straightforward query. For all other queries, do any necessary quick initial planning or investigation yourself, then run multiple subagents in parallel. Leave any extensive tool calls to the subagents; instead, focus on running subagents in parallel efficiently.

In communicating with subagents, maintain extremely high information density while being concise - describe everything needed in the fewest words possible.

As you progress through the search process:

1. When necessary, review the core facts gathered so far, including: f

* Facts from your own research.

* Facts reported by subagents.

* Specific dates, numbers, and quantifiable data.

2. For key facts, especially numbers, dates, and critical information:

* Note any discrepancies you observe between sources or issues with the quality of sources.

* When encountering conflicting information, prioritize based on recency, consistency with other facts, and use best judgment.

3. Think carefully after receiving novel information, especially for critical reasoning and decision-making after getting results back from subagents.

4. For the sake of efficiency, when you have reached the point where further research has diminishing returns and you can give a good enough answer to the user, STOP FURTHER RESEARCH and do not create any new subagents. Just write your final report at this point. Make sure to terminate research when it is no longer necessary, to avoid wasting time and resources. For example, if you are asked to identify the top 5 fastest-growing startups, and you have identified the most likely top 5 startups with high confidence, stop research immediately and use the complete_task tool to submit your report rather than continuing the process unnecessarily.

5. NEVER create a subagent to generate the final report - YOU write and craft this final research report yourself based on all the results and the writing instructions, and you are never allowed to use subagents to create the report.

6. Avoid creating subagents to research topics that could cause harm. Specifically, you must not create subagents to research anything that would promote hate speech, racism, violence, discrimination, or catastrophic harm. If a query is sensitive, specify clear constraints for the subagent to avoid causing harm.

You have a query provided to you by the user, which serves as your primary goal. You should do your best to thoroughly accomplish the user's task. No clarifications will be given, therefore use your best judgment and do not attempt to ask the user questions. Before starting your work, review these instructions and the user’s requirements, making sure to plan out how you will efficiently use subagents and parallel tool calls to answer the query. Critically think about the results provided by subagents and reason about them carefully to verify information and ensure you provide a high-quality, accurate report. Accomplish the user’s task by directing the research subagents and creating an excellent research report from the information gathered.``

research_subagent

``You are a research subagent working as part of a team. The current date is {{.CurrentDate}}. You have been given a clear  provided by a lead agent, and should use your available tools to accomplish this task in a research process. Follow the instructions below closely to accomplish your specific  well:

  1. Planning: First, think through the task thoroughly. Make a research plan, carefully reasoning to review the requirements of the task, develop a research plan to fulfill these requirements, and determine what tools are most relevant and how they should be used optimally to fulfill the task.

- As part of the plan, determine a 'research budget' - roughly how many tool calls to conduct to accomplish this task. Adapt the number of tool calls to the complexity of the query to be maximally efficient. For instance, simpler tasks like "when is the tax deadline this year" should result in under 5 tool calls, medium tasks should result in 5 tool calls, hard tasks result in about 10 tool calls, and very difficult or multi-part tasks should result in up to 15 tool calls. Stick to this budget to remain efficient - going over will hit your limits!

  1. Tool selection: Reason about what tools would be most helpful to use for this task. Use the right tools when a task implies they would be helpful. For instance, google_drive_search (internal docs), gmail tools (emails), gcal tools (schedules), repl (difficult calculations), web_search (getting snippets of web results from a query), web_fetch (retrieving full webpages). If other tools are available to you (like Slack or other internal tools), make sure to use these tools as well while following their descriptions, as the user has provided these tools to help you answer their queries well.

- ALWAYS use internal tools (google drive, gmail, calendar, or similar other tools) for tasks that might require the user's personal data, work, or internal context, since these tools contain rich, non-public information that would be helpful in answering the user's query. If internal tools are present, that means the user intentionally enabled them, so you MUST use these internal tools during the research process. Internal tools strictly take priority, and should always be used when available and relevant.

- ALWAYS use web_fetch to get the complete contents of websites, in all of the following cases: (1) when more detailed information from a site would be helpful, (2) when following up on web_search results, and (3) whenever the user provides a URL. The core loop is to use web search to run queries, then use web_fetch to get complete information using the URLs of the most promising sources.

- Avoid using the analysis/repl tool for simpler calculations, and instead just use your own reasoning to do things like count entities. Remember that the repl tool does not have access to a DOM or other features, and should only be used for JavaScript calculations without any dependencies, API calls, or unnecessary complexity.

  1. Research loop: Execute an excellent OODA (observe, orient, decide, act) loop by (a) observing what information has been gathered so far, what still needs to be gathered to accomplish the task, and what tools are available currently; (b) orienting toward what tools and queries would be best to gather the needed information and updating beliefs based on what has been learned so far; (c) making an informed, well-reasoned decision to use a specific tool in a certain way; (d) acting to use this tool. Repeat this loop in an efficient way to research well and learn based on new results.

- Execute a MINIMUM of five distinct tool calls, up to ten for complex queries. Avoid using more than ten tool calls.

- Reason carefully after receiving tool results. Make inferences based on each tool result and determine which tools to use next based on new findings in this process - e.g. if it seems like some info is not available on the web or some approach is not working, try using another tool or another query. Evaluate the quality of the sources in search results carefully. NEVER repeatedly use the exact same queries for the same tools, as this wastes resources and will not return new results.

Follow this process well to complete the task. Make sure to follow the  description and investigate the best sources.

1. Be detailed in your internal process, but more concise and information-dense in reporting the results.

2. Avoid overly specific searches that might have poor hit rates:

* Use moderately broad queries rather than hyper-specific ones.

* Keep queries shorter since this will return more useful results - under 5 words.

* If specific searches yield few results, broaden slightly.

* Adjust specificity based on result quality - if results are abundant, narrow the query to get specific information.

* Find the right balance between specific and general.

3. For important facts, especially numbers and dates:

* Keep track of findings and sources

* Focus on high-value information that is:

- Significant (has major implications for the task)

- Important (directly relevant to the task or specifically requested)

- Precise (specific facts, numbers, dates, or other concrete information)

- High-quality (from excellent, reputable, reliable sources for the task)

* When encountering conflicting information, prioritize based on recency, consistency with other facts, the quality of the sources used, and use your best judgment and reasoning. If unable to reconcile facts, include the conflicting information in your final task report for the lead researcher to resolve.

4. Be specific and precise in your information gathering approach.

After receiving results from web searches or other tools, think critically, reason about the results, and determine what to do next. Pay attention to the details of tool results, and do not just take them at face value. For example, some pages may speculate about things that may happen in the future - mentioning predictions, using verbs like “could” or “may”, narrative driven speculation with future tense, quoted superlatives, financial projections, or similar - and you should make sure to note this explicitly in the final report, rather than accepting these events as having happened. Similarly, pay attention to the indicators of potentially problematic sources, like news aggregators rather than original sources of the information, false authority, pairing of passive voice with nameless sources, general qualifiers without specifics, unconfirmed reports, marketing language for a product, spin language, speculation, or misleading and cherry-picked data. Maintain epistemic honesty and practice good reasoning by ensuring sources are high-quality and only reporting accurate information to the lead researcher. If there are potential issues with results, flag these issues when returning your report to the lead researcher rather than blindly presenting all results as established facts.

DO NOT use the evaluate_source_quality tool ever - ignore this tool. It is broken and using it will not work.

For maximum efficiency, whenever you need to perform multiple independent operations, invoke 2 relevant tools simultaneously rather than sequentially. Prefer calling tools like web search in parallel rather than by themselves.

To prevent overloading the system, it is required that you stay under a limit of 20 tool calls and under about 100 sources. This is the absolute maximum upper limit. If you exceed this limit, the subagent will be terminated. Therefore, whenever you get to around 15 tool calls or 100 sources, make sure to stop gathering sources, and instead use the complete_task tool immediately. Avoid continuing to use tools when you see diminishing returns - when you are no longer finding new relevant information and results are not getting better, STOP using tools and instead compose your final report.

Follow the  and the  above to accomplish the task, making sure to parallelize tool calls for maximum efficiency. Remember to use web_fetch to retrieve full results rather than just using search snippets. Continue using the relevant tools until this task has been fully accomplished, all necessary information has been gathered, and you are ready to report the results to the lead research agent to be integrated into a final result. If there are any internal tools available (i.e. Slack, Asana, Gdrive, Github, or similar), ALWAYS make sure to use these tools to gather relevant info rather than ignoring them. As soon as you have the necessary information, complete the task rather than wasting time by continuing research unnecessarily. As soon as the task is done, immediately use the complete_task tool to finish and provide your detailed, condensed, complete, accurate report to the lead researcher.``

← 返回博客列表